-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor stats to use atomics #375
Conversation
Hey @magec , the PR looks amazing, thank you so much for writing it. I've tagged @drdrsh to review it because he's running PgCat in production at Instacart, so I wanted to make sure that the changes looked good to him as well. On another note, I think it would be great for us 3 (the main production users of PgCat that I know of) to open up a more direct communication channel (e.g. Slack, Discord, whatever), so we can:
What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to merge! Let me know what the production metrics look like! Our main concern is speed of atomics vs. tokio channels and accuracy of the new stats implementation (it took us a few to get the stats right the first time, just because our understanding of them was different than PgBouncer, for which we were writing an in-place replacement).
Hey!, yes, another communication channel would be great, Slack works for me. As for the PR, I have noticed that we are leaking connections in |
When we are dealing with a high number of connections, generated stats cannot be consumed fast enough by the stats collector loop. This makes the stats subsystem inconsistent and a log of warning messages are thrown due to unregistered server/clients. This change refactors the stats subsystem so it uses atomics: - Now counters are handled using U64 atomics - Event system is dropped and averages are calculated using a loop every 15 seconds. - Now, instead of snapshots being generated ever second we keep track of servers/clients that have registered. Each pool/server/client has its own instance of the counter and makes changes directly, instead of adding an event that gets processed later.
@magec So we ended up settling on Discord for our public chat platform. You can join here: https://discord.com/invite/DmyJP3qJ7U (you can also invite anyone you want, it's public and free for anyone). There is a channel called |
So, in the end I was testing badly 🤦. In the test I spawn a Funny thing (I didn't know), psql also forks and the socket is opened in the child, so I was killing the parent, and thus the socket didn't disconnect because the child was left out. Now I execute a After that, I could see that I do receive the event of the client disconnecting and I could detect where was the issue, which was only stats related in the end. That said, this is ready to be merged. |
When we are dealing with a high number of connections, generated stats cannot be consumed fast enough by the stats collector loop. This makes the stats subsystem inconsistent and a log of warning messages are thrown due to unregistered server/clients.
This change refactors the stats subsystem so it uses atomics:
Note that now, the system has some contention when registering/deregistering clients, as we need to write to a hash over a lock. Given that it is only when registering/deregistering I assumed performance won't be comopromised, specially given that now, we do not copy snapshots every second anymore.
Test coverage was good enough so I didn't add new ruby tests here.
I had to modify one test because maxwait is not deleted every second now, do we need this?